Blog
How to Detect Normally Distributed Data in Linear Regression Analysis
When you conduct data analysis using linear regression, there are several assumptions that must be met. We need to fulfill these assumptions to ensure that the estimation results are consistent and unbiased.
One of the key assumptions when using the OLS (Ordinary Least Squares) method in linear regression is that the residuals must be normally distributed.
This raises a common question: is the method for detecting normality in linear regression the same as in other parametric statistical analyses? In this article, I will discuss how to detect data normality specifically in linear regression analysis using the OLS method.
Do We Test the Residuals or the Raw Data?
The first question that often arises when performing a normality test in linear regression is this: should we test the residuals or the raw data?
As I mentioned earlier, one of the assumptions of the OLS method in linear regression is that the residuals must be normally distributed. From here, we already have the answer — in regression analysis, what we test is the residuals, not the raw observations.
Therefore, when conducting a normality test in regression, we are referring to the distribution of the residuals. Now let’s take a look at what residuals are and how they differ from raw data.
What Is a Residual?
Residuals in regression analysis need to be calculated first. They are different from the raw data or observations that we collect, whether from cross-sectional data or time series.
By definition, a residual is the difference between the actual observed value and the predicted value. Some also describe it as the difference between the actual Y and the predicted Y. In other words, it’s based on the dependent variable.
How to Calculate Residuals
Based on the definition above, calculating residuals requires two types of data: actual Y values and predicted Y values.
The actual data come from the dependent variable, which may be collected either from field data or time series data.
Next, the predicted Y values are obtained after we run a linear regression using the available dependent and independent variables. From the regression output, we will get an intercept and coefficients for each independent variable.
Using those coefficients, we can then calculate the predicted Y values. From both the actual and predicted Y, we can compute the residuals.
So, if your study uses 200 samples, you will also get 200 residual values. That means residuals must be calculated for each observation or sample in the dataset.
How to Detect Residual Normality
To detect whether the residuals are normally distributed or not in a regression model, we can use statistical tests or visual tools such as histograms.
However, most researchers prefer using statistical tests because they offer more accurate conclusions. To support your findings, you can also create a histogram to visualize the distribution of the residuals and see whether it follows a normal curve.
Two common statistical tests for checking normality are the Shapiro-Wilk test and the Kolmogorov-Smirnov test.
Both tests generally lead to the same conclusion. Here’s a simple way to state the hypotheses for the normality test:
H₀: Residuals are normally distributed
H₁: Residuals are not normally distributed
Once the hypotheses are defined, we need to set a significance level (alpha). For example, let’s use an alpha of 5%. The decision rule would be as follows:
If p-value > 0.05, accept the null hypothesis
If p-value ≤ 0.05, reject the null hypothesis (accept the alternative hypothesis)
So, if your Shapiro-Wilk or Kolmogorov-Smirnov test gives a p-value of 0.200, it means you accept the null hypothesis. In other words, the residuals are normally distributed.
Closing Remarks
After reading this article, I hope readers will better understand the assumption of normally distributed residuals in linear regression analysis using the OLS method.
The purpose of meeting this assumption is to ensure that we obtain a Best Linear Unbiased Estimator (BLUE).
To test for residual normality, you can use the Shapiro-Wilk or Kolmogorov-Smirnov test.
Alright, that’s all I can share in this article. I hope it’s useful and provides new insights for those who need it. See you again in the next article from Kanda Data.